Code
import plotly.express as px
import pandas as pdPlease find the code for data collection here
In the initial analysis of our college basketball reddit data we collected data from three subreddits: “CollegeBasketball”, “tarheels”, and “jayhawks”. Before starting more complex topics such as natural language processing and machine learning on this data it is important to understand the data you are working with and in this exploratory data analysis section we do just that. First we take a look at the individual dataframe itself and do some basic cleaning to remove missing and select only relevant columns. Then using data transformation we are able to identify that the most supported teams in the college basketball subreddit include Florida State, Illinois, and Purdue. We also identified some of the most talked about games with the number one obviously being the national championship. Using these findings we dug deeper into the eda.
Our first business goal was to look at how reddit activity changes overtime and an interesting finding was that reddit activity in the comments section of the subreddit is weekly and hour of day cyclical, meaning that activity correlates with when games are played during the season. Furthermore, we also wanted to identify which teams are the most discussed during the season. We found that popular teams like Miami, Kansas, North Carolina, and Purdue were the most discussed. These findings along with the rest of our eda motivated the other analysis in this project.
The first step in any data analysis is to explore the dataframe. In this first section of our eda we will explore the data that we collected which contains submissions and comments data of the r/CollegeBasketball, r/tarheels, and r/jayhawks subreddits between September 2021 and April 2022 (The approximate length of the College Basketball Season). No additional cleaning has already been done on this data. The full code for this analysis can be found here.
The breakdown of the 3 subreddits by number of posts is heavily imbalanced as is expected. In submissions there are just over 22,000 posts and in comments there are just over 1,575,000 comments. The majority of these posts come from the CollegeBasketball subreddit.
The important variables in the submissions dataframe are
subreddit: The Subreddit that the post is inauthor : Who authored the postselftext : The contents of the postcreated_utc : The timestamp of the postThe important variables in the comments dataframe are
subreddit: The Subreddit that the post is inauthor : Who authored the postbody : The contents of the postcreated_utc : The timestamp of the postscore: Upvotes - Downvotes on the commentSince Reddit is a user driven platform there will likely be issues with missing text in posts or deleted posts/comments. In the comments dataframe there are 85,295 data points with no value for the body column and in the submission dataframe there are 12,348 missing values in the selftext column. Once we remove these datapoints the number of datapoints in each subreddit looks as follows:
Now we can look at the preliminary data for the first time.
author_flair_text is a column that contains information about the user icon of the reddit user. In most of these posts this is the team that the author supports. We created a new column that extracts the team information from this column to get a representation of which team a user supports. In submissions the most common icons are the default rcbb (r/collegebasketball) and None (no icon). Ignoring those we see that the most supported teams are Florida State, Illinois, Purdue, and Duke. In comments, None is the most common icon, however after that the most supported teams are Purdue, North Carolina, Illinois, and Kentucky. This shows that some fan bases may be more active posters while others may be more involved in the comments section.
Some posts in the submissions of r/CollegeBasketball are game threads where people can comment on and discuss specific games that are being played. We can extract information about the title to determine the teams playing in the game, their rankings, and the time the game is being played. From here we can determine which games have the most comments and thereby can act as a proxy for which games were the most watched on reddit. We see that the most commented on games involve North Carolina, and further analysis of these games show that the three most popular games are the Elite Eights, Final Four, and Championship games for North Carolina. Additionally we see that all of the most popular games are between ranked teams.
Expanding on this finding that ranked matchups seem to have more interest we did some further analysis to look at the popularity of games over the course of a season and also the difference between ranked and unranked mathcups. Here we see that ranked matchups have significantly higher average number of comment in all months of the season. If we compare only February, ranked matchups have on average 12.5 times the number of comments of unranked matchups. Additionally we see that April and March have the highest number of average comments across both groups, while the earlier months in the season (November and December) have lower average comments per game.
Goal: Determine how reddit activity ebbs and flows over the course of a collegiate basketball season. Construct visuals to aid in understanding how much reddit users post over the course of the 2022 season.
To answer this business goal we first want to simply look at the trend of reddit posts/comments by day over the course of the season in the following figure. Given that the first college basketball game in this season was November 9th, 2021 these counts of comments and submissions per day make sense. We see lower comments and submissions leading up to the season. This is where the two groups deviate. Submissions spike right at the beginning of the season (lots of fans critiquing or praising their new look teams perhaps?), then we see a drop off in submissions as the season settles in, followed by a rise up to the tournament in March, with a quick drop off in the middle of March as many teams are eliminated from contention for the title. Comments, on the other hand are relatively low at the start of the season, and see a rise up until the NCAA tournament in March/April. This indicates that commenting is perhaps more common later in the season as competition because more important and the NCAA tournament gets closer. Submissions are more steady perhaps because these are longer posts that mostly year long fans write.
r/CollegeBasketball subreddit during the 2021-2022 College Basketball Season.We also wanted to look at the average number of words per post/comment to give another angle by how to analyze “reddit activity”. Number of posts is one way to look at it, but it can also be important to see how much time people spend writing these posts and comments. We see this in the figure below. In the comments dataframe we see that the average length each comment is longer before the season starts (November 1st), than during the season (November 1st - April 5th). This is likely because of the lack of posts during the preseason as we saw the in the previous graph. It does say something interesting though that the few people who do comment in the subreddit during the offseason, typically are writing longer comments. In the submissions dataframe we see a different trend. The offseason has fewer or the same average words per submission, however we do see some spikes. After digging into these specific days we identified that these spikes with 600+ words per day occured on days with very few posts (<5) and one of those posts was particularly long. For example on September 2nd, 2021 one user detailed their year long simulation of which college basketball teams were the most lucky, and on October 17th, 2021 another user listed each college basketball team where the first letters of the teams names were swapped (Tar Heels = Har Teels). The spike in March of 2022 for submissions is the same day as the start of the 2022 NCAA Basketball Tournament.
r/CollegeBasketball subreddit during the 2021-2022 College Basketball Season.One final way we want to analyze the reddit activity during the college basketball season is to look at the number of comments as a weekly and daily cyclical trend. College Basketball games are usually played on weeknights and all day on the weekends, lets see if the distribution of comments align with when the games are played. The heatmap below shows the distribution of comments over hour of the day and day of the week. The highest density for total comments are around 9pm Tuesdays and Thursdays and around 2pm Saturdays. In general we also see weekdays having more comments after 5pm, while Sundays and Saturdays have more even distribution througout the day. This aligns with when college basketball games are typically played and indicates that the CollegeBasketball Subreddit is much more active when games are currently being played.
Note: The plot uses Eastern Time as many games are played in EST, but there may be some issues with Central, Mountain, and Pacific time games. Since all games are played within 3 hours of eastern time we did not find this to be a fatal issue, but it must be considered during analysis.
Goals: Explore how the subreddit page of the NCAA champion in 2022, the Kansas Jayhawks, varied over the course of the season. Identify any trends in post frequency or comment length that may correlate to on-the-court successes or failures.
Explore how the subreddit page of the NCAA runner ups in 2022, the UNC Tarheels, varied over the course of the season. Identify any trends in post frequency or comment length that may correlate to on-the-court successes or failures.
In order to analyze these subreddits, we’ll begin by diving into the comment and submission frequency for r/jayhawks.
r/jayhawksThis showed us some interesting things right off the bat about this subreddit. First, the primary thing that sticks out about this graph is that by far and away, the most comments occur in late March and early April. This aligns perfectly with earlier hypotheses that the most interaction on this subreddit would be during the NCAA March Madness tournament.
Taking this a step further, the Jayhawks won the NCAA Championship on 4 April 2022. This aligns perfect with the huge spike - which is likely that championship game. According to Forbes, this was the most-watched NCAA final game in history, so this spike correlates nicely with that!
There is a general “buzz” about the Jayhawks primarily during the collegiate basketball season - November through April. Our data only includes the CBB season, but we can extrapolate from the near-zero comment rate in September that there is not much excitement around the team other than during the season - to include big recruiting commitments.
Before the season, fans identified a number of games that were must watches: December 11, December 21, January 1, January 22, January 29, among others. Although there were potentially smaller spikes in comments during these games, there were vastly outweighed by the increased rate of comments during the finals and the NCAA tournament.
This will help inform our analysis in the future: because the number of comments on this page was so much smaller than on the main r/CollegeBasketball page, our analysis may be more prescient taking Kansas mentions on that main page and using that information instead of focusing on this subreddit.
Let’s do the same thing for submissions and see if we get a similar result.
r/jayhawksThere were so few submissions on the page that the maximum number of submissions in one day was eight! This tells us a lot about how Jayhawk fans react - they don’t do it on this page! Maybe there is more interest on the larger page - maybe there is increased engagement there, and that is preferable for Jayhawk fans. We’ll be able to find that out in the next EDA question.
In order to analyze these subreddits, we’ll begin by diving into the comment and submission frequency for r/tarheels.
r/tarheelsThat’s a good start! This is interesting to compare to the r/jayhawks analysis of before.
The commenting patterns and spike follow the same pattern as the r/jayhawks subreddit. There is minimal engagement before the tournament, with some excitement building in February and March. Then, during the main tournament and especially during the 4 April championship game, there is a huge spike in comments!
A number of games were identified by fans as must-watch throughout the season. The following is an excerpt from the linked article about the games the fandom was most excited for:
North Carolina at Duke, March 5 – It’ll be the swan song at Cameron for Mike Krzyzewski, and it comes, of course, against rival North Carolina. The atmosphere in this one will be insane. Coach K announced he will retire after this season, and his final home game is against the Tar Heels in the ACC regular-season finale. I can’t wait to see what the resale market looks like for this matchup.
Duke vs. Kentucky, Nov. 9 in New York – This is Coach K vs. John Calipari for one final time in the Champions Classic, basically opening the college hoops season. Both come off disappointing campaigns, and both will have a chance to get some early momentum this year. It’ll also feature two of the top frosh in the country: Duke’s Paolo Banchero and Kentucky’s TyTy Washington.
Gonzaga vs. Duke, Nov. 26 in Las Vegas – Two of the best programs, and arguably the two best freshmen, go up against one another at T-Mobile Arena. The Zags have long and skilled 7-footer Chet Holmgren, while Duke will rely heavily on big, strong and athletic forward Paolo Banchero.
Gonzaga vs. UCLA, Nov. 23 in Las Vegas – A rematch of the Final Four matchup in which Jalen Suggs hit the memorable game-winning shot, and both are contenders to cut down the nets this year.
Texas at Gonzaga, Nov. 13 – New Longhorns coach Chris Beard takes his team to Spokane and will face Drew Timme, Chet Holmgren and the top-ranked Zags. It’s a chance for Texas and all of Beard’s transfers to show they are for real.
Villanova at UCLA, Nov. 12 – A matchup of two top-five teams out at Pauley Pavilion. Villanova brought back a pair of fifth-year guys in Collin Gillespie and Jermaine Samuels, while UCLA will be led by March Madness star Johnny Juzang.
Texas at Texas Tech, Feb. 1 – This is one you need to be in Lubbock for to truly appreciate the atmosphere. It won’t exactly be a heartwarming homecoming for Chris Beard. Sure, he led the Red Raiders to a national title game appearance and an Elite Eight, but the fans in Lubbock still can’t come to grips with why he’d leave for UT. In fact, they strongly dislike Beard now. This is his first appearance back in Lubbock, and it’ll be interesting to see the reception he receives from the fan base. I don’t expect a lot of pleasantries.
Kentucky at Kansas, Jan. 29 – It’s Kentucky vs. Kansas in the Big 12/SEC Challenge at Allen Fieldhouse. Need I say more?
As per the r/jayhawks subreddit, although there may have been a spike on these dates, nothing compares whatsoever to the spike around the championship engagement.
Just as before, we’ll do the same thing for submissions.
r/tarheelsWow! There were many fewer submissions on the tarheels subreddit than the jayhawks subreddit. This could, again, be due to a number of reasons, and maybe the engagement on the main page makes up for this difference. We’ll have to see in our next portion of EDA!
We found that many of the questions that arose during our question 3 analysis aligned nicely with the next step of our process: finding which team mentions in the main subreddit occurred the most often. Please see our EDA Question 3 for more information.
import plotly.express as px
import pandas as pdsubmissions_counts_long = pd.read_csv('../data/csv/submissions_counts_long.csv')
comments_counts_long = pd.read_csv('../data/csv/comments_counts_long.csv')| eval: true
# Create the interactive Histogram for Submissions Counts
fig1 = px.bar(submissions_counts_long, x='Teams', y='Mentions', title='Histogram of Team Mentions in Submissions')
fig1.update_layout(
xaxis_title='Teams',
yaxis_title='Mentions',
xaxis={'categoryorder':'total descending'},
title_x=0.5
)
fig1.show()Goal: Identify the teams that have the most engagement on the main college basketball subreddit.
To commence our analysis of the 2022 basketball tournament, it’s crucial to identify key teams for detailed study. Among the 68 participating teams, our focus will be on the Elite 8. These teams, namely Houston, Kansas, Villanova, Duke, Arkansas, Saint Peter’s, North Carolina, and Miami, not only excelled in the tournament but also garnered significant audience interest, as per the 2022 DI Men’s Basketball Championship Official Bracket (https://www.ncaa.com/brackets/basketball-men/d1/2022). Additionally, incorporating insights from author_flair_text analysis, we included Purdue and Illinois, two teams that enjoyed considerable support, to broaden our scope.
The total number of mentions in the submissions dataframe for some of the best teams during the 2021-2022 college basketball season
The graph presents a striking observation: Miami is mentioned markedly more often than other teams, a notably unexpected finding. Additionally, it reveals that the comment mention counts for Kansas, the 2022 champion, North Carolina, the runner-up, and the consistently strong Purdue, significantly exceed those of other teams. This aligns with typical expectations and highlights their prominence in discussions.
The data from the comment counts largely mirrors the trends observed in the previous figure, with a notable exception: Purdue’s mentions significantly surpass those of Kansas and North Carolina, aligning them more closely with Miami’s high mention frequency. This suggests a distinct level of engagement and interest in Purdue, setting it apart from the other teams.
The empirical analysis reveals a correlation between a team’s number of mentions and factors like popularity and season performance. Frequently mentioned teams include top performers like Kansas, the eventual champions, and unexpected contenders like North Carolina, which aligns with our initial expectations. Intriguingly, a team like Miami, despite a lower seed ranking and only reaching the top eight, has garnered more submissions and comment mentions than even the championship team. This anomaly suggests there are additional underlying factors influencing Miami’s high visibility, warranting further investigation to uncover these reasons.
As we progress, our objective is to conduct a detailed analysis of the daily mention trends for each team. This will offer a more nuanced understanding of how audience engagement evolved over the course of the tournament. To optimize our analysis, and informed by our previous findings, we’ve narrowed our focus to Miami, Kansas, North Carolina, and Purdue. These four teams, having attracted a significantly higher volume of mentions, are ideal candidates for our in-depth examination of engagement patterns.
The chart illustrates a distinct trend: From the season’s onset in November 2021 to March 2022, Miami led in commit frequency, surpassing other teams. However, a shift occurred in mid-March 2022, as Miami’s commit frequency began to decline, eventually aligning with the baseline levels seen in other teams. Concurrently, there was a notable surge in submissions related to Kansas and North Carolina, culminating in a peak in early April 2022. This increase is clearly linked to their progression to the finals, reflecting heightened audience interest and engagement as these teams advanced in the tournament.
The data shows that while Miami and Purdue have a higher volume of daily comments, the disparity isn’t as marked when compared to other teams. Both teams experienced a peak in comment activity from mid-to-late March. This trend aligns with key tournament events: Purdue’s journey ended in the top 16 after their loss to St. Peter’s on March 25, while Miami concluded in the top 8 following their defeat by Kansas on March 27. North Carolina’s peak in comment volume around mid-March can be directly attributed to their victory over the previous year’s champions, Baylor. Kansas, on the other hand, saw its comment numbers reach the highest point among all teams in early April 2022, a testament to their ultimate championship victory, drawing substantial audience engagement and discussion.
Overall, the variations in comment volume mirror the dynamics of audience engagement. As the season approaches its climax, particularly the finals, discussions intensify. The disparity in comment counts across different teams highlights the variations in their popularity and performance throughout the season. The data vividly illustrates that audience engagement peaked during critical matches, with this fervor culminating in the finals on April 4, indicating a high level of interest and involvement from fans during these pivotal moments of the tournament.